Incorporating a Covariate (as a main and linear effect) into our two-way ANOVA models
Add fruit_day to m_1
m_2 <- lm (bmi ~ fruit_day + exerany + health, data = train_c4im)
How well does this model fit the training data?
bind_rows (glance (m_1), glance (m_2)) |>
mutate (mod = c ("m_1" , "m_2" )) |>
select (mod, r.sq = r.squared, adj.r.sq = adj.r.squared,
sigma, df, df.res = df.residual, AIC, BIC) |>
gt () |> tab_options (table.font.size = 20 )
mod
r.sq
adj.r.sq
sigma
df
df.res
AIC
BIC
m_1
0.08890746
0.08204682
6.115123
5
664
4335.776
4367.327
m_2
0.09790484
0.08974108
6.089441
6
663
4331.126
4367.184
ANOVA for the m_2 model
tidy (anova (m_2)) |> gt () |> tab_options (table.font.size = 20 )
term
df
sumsq
meansq
statistic
p.value
fruit_day
1
467.1274
467.12739
12.597388
4.135820e-04
exerany
1
761.3825
761.38250
20.532794
6.953235e-06
health
4
1439.7012
359.92529
9.706385
1.239354e-07
Residuals
663
24584.8954
37.08129
NA
NA
m_2 coefficients
tidy (m_2, conf.int = TRUE , conf.level = 0.90 ) |>
gt () |> tab_options (table.font.size = 20 )
term
estimate
std.error
statistic
p.value
conf.low
conf.high
(Intercept)
28.6715422
0.7979500
35.9315038
9.237344e-158
27.3571948
29.9858896
fruit_day
-0.5458806
0.2122799
-2.5715128
1.034232e-02
-0.8955386
-0.1962226
exerany
-2.0476701
0.5509276
-3.7167681
2.188237e-04
-2.9551334
-1.1402069
healthVG
0.5607594
0.7000384
0.8010409
4.233949e-01
-0.5923125
1.7138313
healthG
3.0052279
0.7213012
4.1663980
3.504237e-05
1.8171329
4.1933229
healthF
3.5515381
0.9053858
3.9226793
9.669720e-05
2.0602272
5.0428490
healthP
4.5620682
1.3519338
3.3744760
7.826611e-04
2.3352234
6.7889129
m_2 Residuals
par (mfrow = c (1 ,2 )); plot (m_2, which = c (1 ,2 ))
m_2 Residuals
par (mfrow = c (1 ,2 )); plot (m_2, which = c (3 ,5 ))
Who is that poorest fit case?
Plot suggests we look at row 28
train_c4im |> slice (28 ) |>
select (ID, bmi, fruit_day, exerany, health) |>
gt () |> tab_options (table.font.size = 20 )
ID
bmi
fruit_day
exerany
health
320
63
1
0
P
What is unusual about this subject?
train_c4im |> arrange (desc (bmi))
# A tibble: 670 × 9
ID bmi inc_imp fruit_day drinks_wk female exerany health race_eth
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <fct> <fct>
1 320 63 20581 1 0.7 1 0 P Black non-Hisp…
2 959 59.0 5720 0.1 0 1 0 F White non-Hisp…
3 1078 56.3 43948 1.02 0 1 0 F White non-Hisp…
4 633 51.5 39964 1 0 0 0 G White non-Hisp…
5 947 51.2 10280 2.02 0.93 1 0 F White non-Hisp…
6 535 50.5 22307 0.46 0 0 1 G Hispanic
7 888 50.0 113725 2.02 0 1 1 G White non-Hisp…
8 743 49.0 31469 0.13 0 0 0 G Black non-Hisp…
9 551 48.8 91510 1.14 0.93 0 1 G White non-Hisp…
10 924 48.2 11825 2 0 1 1 P Multiracial no…
# ℹ 660 more rows
Include the interaction term?
m_2int <- lm (bmi ~ fruit_day + exerany * health,
data = train_c4im)
ANOVA for the m_2int model
tidy (anova (m_2int)) |> gt () |> tab_options (table.font.size = 20 )
term
df
sumsq
meansq
statistic
p.value
fruit_day
1
467.1274
467.12739
13.096273
3.185537e-04
exerany
1
761.3825
761.38250
21.345939
4.612955e-06
health
4
1439.7012
359.92529
10.090780
6.231466e-08
exerany:health
4
1079.2034
269.80084
7.564072
5.829277e-06
Residuals
659
23505.6920
35.66873
NA
NA
m_2int coefficients
tidy (m_2int, conf.int = TRUE , conf.level = 0.90 ) |>
rename (se = std.error, t = statistic, p = p.value) |>
gt () |> tab_options (table.font.size = 18 )
term
estimate
se
t
p
conf.low
conf.high
(Intercept)
28.2736467
1.433168
19.7280703
1.964272e-68
25.91297599
30.6343175
fruit_day
-0.6101570
0.208728
-2.9232160
3.583167e-03
-0.95396723
-0.2663467
exerany
-1.4458021
1.541761
-0.9377602
3.487112e-01
-3.98534313
1.0937390
healthVG
-0.6573334
1.625530
-0.4043810
6.860638e-01
-3.33485589
2.0201891
healthG
2.7490624
1.611665
1.7057285
8.852985e-02
0.09437803
5.4037469
healthF
7.5895769
1.769876
4.2881981
2.070809e-05
4.67429252
10.5048613
healthP
9.0747824
2.541361
3.5708356
3.817403e-04
4.88873094
13.2608338
exerany:healthVG
1.5952539
1.793513
0.8894576
3.740817e-01
-1.35896550
4.5494733
exerany:healthG
0.4226432
1.794327
0.2355441
8.138596e-01
-2.53291698
3.3782033
exerany:healthF
-6.4107251
2.062051
-3.1089073
1.958699e-03
-9.80727160
-3.0141785
exerany:healthP
-6.4994464
2.993295
-2.1713351
3.026195e-02
-11.42990959
-1.5689833
ANOVA: Compare m_2 & m_2int
Analysis of Variance Table
Model 1: bmi ~ fruit_day + exerany + health
Model 2: bmi ~ fruit_day + exerany * health
Res.Df RSS Df Sum of Sq F Pr(>F)
1 663 24585
2 659 23506 4 1079.2 7.5641 5.829e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m_2int Residuals
par (mfrow = c (1 ,2 )); plot (m_2int, which = c (1 ,2 ))
m_2int Residuals
par (mfrow = c (1 ,2 )); plot (m_2int, which = c (3 ,5 ))
Which of the four models fits best?
In the training sample, we have…
mod
r.sq
adj.r.sq
sigma
df
df.res
AIC
BIC
m_1
0.08890746
0.08204682
6.115123
5
664
4335.776
4367.327
m_2
0.09790484
0.08974108
6.089441
6
663
4331.126
4367.184
m_1int
0.12632022
0.11440640
6.006371
9
660
4315.682
4365.262
m_2int
0.13750412
0.12441617
5.972330
10
659
4309.050
4363.137
Adjusted \(R^2\) , \(\sigma\) , AIC and BIC all improve as we move down from m1 towards m2_int.
BUT the training sample cannot judge between models accurately. Our models have already seen that data.
Predicting bmi in the test sample
For fairer comparisons, consider the held-out sample.
We’ll use augment from the broom package…
m1_test_aug <- augment (m_1, newdata = test_c4im)
m1int_test_aug <- augment (m_1int, newdata = test_c4im)
m2_test_aug <- augment (m_2, newdata = test_c4im)
m2int_test_aug <- augment (m_2int, newdata = test_c4im)
This adds fitted values (predictions) and residuals (errors) …
m1_test_aug |> select (ID, bmi, .fitted, .resid) |>
slice (1 : 2 ) |> gt () |> tab_options (table.font.size = 20 )
ID
bmi
.fitted
.resid
4
26.51
28.84456
-2.334562
5
24.25
28.84456
-4.594562
The yardstick package
For each subject in the testing set, we will need:
estimate = model’s prediction of that subject’s bmi
truth = the bmi value observed for that subject
Calculate a summary of the predictions across the \(n\) test subjects
Summaries from yardstick
\(R^2\) = square of the correlation between truth and estimate
mae = mean absolute error …
\[
mae = \frac{1}{n} \sum{|truth - estimate|}
\]
rmse = root mean squared error …
\[
rmse = \sqrt{\frac{1}{n} \sum{(truth - estimate)^2}}
\]
Testing Results (using \(R^2\) )
We can use the yardstick package and its rsq() function.
testing_r2 <- bind_rows (
rsq (m1_test_aug, truth = bmi, estimate = .fitted),
rsq (m1int_test_aug, truth = bmi, estimate = .fitted),
rsq (m2_test_aug, truth = bmi, estimate = .fitted),
rsq (m2int_test_aug, truth = bmi, estimate = .fitted)) |>
mutate (model = c ("m_1" , "m_1int" , "m_2" , "m_2int" ))
testing_r2 |> gt () |> tab_options (table.font.size = 20 )
.metric
.estimator
.estimate
model
rsq
standard
0.07161065
m_1
rsq
standard
0.03973444
m_1int
rsq
standard
0.06517957
m_2
rsq
standard
0.03635748
m_2int
Mean Absolute Error?
Consider the mean absolute prediction error …
testing_mae <- bind_rows (
mae (m1_test_aug, truth = bmi, estimate = .fitted),
mae (m1int_test_aug, truth = bmi, estimate = .fitted),
mae (m2_test_aug, truth = bmi, estimate = .fitted),
mae (m2int_test_aug, truth = bmi, estimate = .fitted)) |>
mutate (model = c ("m_1" , "m_1int" , "m_2" , "m_2int" ))
testing_mae |> gt () |> tab_options (table.font.size = 20 )
.metric
.estimator
.estimate
model
mae
standard
4.428381
m_1
mae
standard
4.623948
m_1int
mae
standard
4.476413
m_2
mae
standard
4.706638
m_2int
Root Mean Squared Error?
How about the square root of the mean squared prediction error, or RMSE?
testing_rmse <- bind_rows (
rmse (m1_test_aug, truth = bmi, estimate = .fitted),
rmse (m1int_test_aug, truth = bmi, estimate = .fitted),
rmse (m2_test_aug, truth = bmi, estimate = .fitted),
rmse (m2int_test_aug, truth = bmi, estimate = .fitted)) |>
mutate (model = c ("m_1" , "m_1int" , "m_2" , "m_2int" ))
testing_rmse |> gt () |> tab_options (table.font.size = 20 )
.metric
.estimator
.estimate
model
rmse
standard
5.728520
m_1
rmse
standard
6.025257
m_1int
rmse
standard
5.769484
m_2
rmse
standard
6.082435
m_2int
Other yardstick summaries (1)
rsq_trad() = defines \(R^2\) using sums of squares.
The rsq() measure we showed a few slides ago is a squared correlation coefficient guaranteed to be in (0, 1).
mape() = mean absolute percentage error
mpe() = mean percentage error
Other yardstick summaries (2)
huber_loss() = Huber loss (often used in robust regression), which is less sensitive to outliers than rmse().
ccc() = concordance correlation coefficient, which attempts to measure both consistency/correlation (like rsq()) and accuracy (like rmse()).
See the yardstick home page for more details.
Incorporating Non-Linearity into our models
Incorporating a non-linear term for fruit_day
Suppose we wanted to include a polynomial term for fruit_day:
lm(bmi ~ fruit_day, data = train_c4im)
lm(bmi ~ poly(fruit_day, 2), data = train_c4im)
lm(bmi ~ poly(fruit_day, 3), data = train_c4im)
Polynomial Regression
A polynomial in the variable x of degree D is a linear combination of the powers of x up to D.
For example:
Linear: \(y = \beta_0 + \beta_1 x\)
Quadratic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2\)
Cubic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)
Quartic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\)
Quintic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5\)
Fitting such a model creates a polynomial regression .
Raw Polynomials vs. Orthogonal Polynomials
Predict bmi using fruit_day with a polynomial of degree 2.
(temp1 <- lm (bmi ~ fruit_day + I (fruit_day^ 2 ),
data = train_c4im))
Call:
lm(formula = bmi ~ fruit_day + I(fruit_day^2), data = train_c4im)
Coefficients:
(Intercept) fruit_day I(fruit_day^2)
29.5866 -1.2726 0.1052
This uses raw polynomials. Predicted bmi for fruit_day = 2 is
bmi = 29.5925 - 1.2733 (fruit_day) + 0.1051 (fruit_day^2)
= 29.5925 - 1.2733 (2) + 0.1051 (4)
= 27.466
Does the raw polynomial match our expectations?
temp1 <- lm (bmi ~ fruit_day + I (fruit_day^ 2 ),
data = train_c4im)
augment (temp1, newdata = tibble (fruit_day = 2 )) |>
gt () |> tab_options (table.font.size = 20 )
fruit_day
.fitted
2
27.46205
and this matches our “by hand” calculation. But it turns out most regression models use orthogonal rather than raw polynomials…
Fitting an Orthogonal Polynomial
Predict bmi using fruit_day with an orthogonal polynomial of degree 2.
(temp2 <- lm (bmi ~ poly (fruit_day,2 ), data = train_c4im))
Call:
lm(formula = bmi ~ poly(fruit_day, 2), data = train_c4im)
Coefficients:
(Intercept) poly(fruit_day, 2)1 poly(fruit_day, 2)2
28.084 -21.613 8.011
This looks very different from our previous version of the model.
What happens when we make a prediction, though?
Prediction in the Orthogonal Polynomial Model
Remember that in our raw polynomial model, our “by hand” and “using R” calculations both concluded that the predicted bmi for a subject with fruit_day = 2 was 27.466.
Now, what happens with the orthogonal polynomial model temp2 we just fit?
augment (temp2, newdata = data.frame (fruit_day = 2 )) |>
gt () |> tab_options (table.font.size = 20 )
fruit_day
.fitted
2
27.46205
No change in the prediction.
Fits of raw vs orthogonal polynomials
The two models are, in fact, identical.
Why do we use orthogonal polynomials?
The main reason is to avoid having to include powers of our predictor that are highly collinear.
Variance Inflation Factor assesses collinearity…
vif (temp1) ## from rms package
fruit_day I(fruit_day^2)
4.652178 4.652178
Orthogonal polynomial terms are uncorrelated with one another, easing the process of identifying which terms add value to our model.
poly(fruit_day, 2)1 poly(fruit_day, 2)2
1 1
Why orthogonal rather than raw polynomials?
The tradeoff is that the raw polynomial is a lot easier to explain in terms of a single equation in the simplest case.
Actually, we’ll usually avoid polynomials in our practical work, and instead use splines, which are more flexible and require less maintenance, but at the cost of pretty much requiring you to focus on visualizing their predictions rather than their equations.
Adding a Second Order Polynomial to our Models
m_3 <- lm (bmi ~ poly (fruit_day,2 ) + exerany + health,
data = train_c4im)
Comparison to other models without the interaction…
mod
r.sq
adj.r.sq
sigma
df
df.res
AIC
BIC
m_1
0.08890746
0.08204682
6.115123
5
664
4335.776
4367.327
m_2
0.09790484
0.08974108
6.089441
6
663
4331.126
4367.184
m_3
0.09794577
0.08840743
6.093900
7
662
4333.096
4373.661
Tidied summary of m_3 coefficients
term
est
se
t
p
conf.low
conf.high
(Intercept)
27.8644132
0.7424070
37.5325307
4.410943e-166
26.6415511
29.087275
poly(fruit_day, 2)1
-15.8554311
6.1642322
-2.5721664
1.032334e-02
-26.0088995
-5.701963
poly(fruit_day, 2)2
1.0812412
6.2385699
0.1733156
8.624564e-01
-9.1946732
11.357156
exerany
-2.0319807
0.5587135
-3.6368924
2.973984e-04
-2.9522704
-1.111691
healthVG
0.5606014
0.7005517
0.8002285
4.238655e-01
-0.5933183
1.714521
healthG
3.0024230
0.7220108
4.1584183
3.626319e-05
1.8131567
4.191689
healthF
3.5528349
0.9060797
3.9211064
9.733049e-05
2.0603779
5.045292
healthP
4.5325003
1.3636377
3.3238304
9.368376e-04
2.2863728
6.778628
m_3 Residuals
par (mfrow = c (1 ,2 )); plot (m_3, which = c (1 ,2 ))
m_3 Residuals
par (mfrow = c (1 ,2 )); plot (m_3, which = c (3 ,5 ))
Add in the interaction
m_3int <- lm (bmi ~ poly (fruit_day,2 ) + exerany * health,
data = train_c4im)
Comparison to other models with the interaction…
mod
r.sq
adj.r.sq
sigma
df
df.res
AIC
BIC
m_1int
0.1263202
0.1144064
6.006371
9
660
4315.682
4365.262
m_2int
0.1375041
0.1244162
5.972330
10
659
4309.050
4363.137
m_3int
0.1375923
0.1231751
5.976561
11
658
4310.982
4369.576
Tidied summary of m_3int coefficients
term
est
se
t
p
conf.low
conf.high
(Intercept)
27.3965514
1.410188
19.4275830
8.553086e-67
25.07372766
29.719375
poly(fruit_day, 2)1
-17.6962306
6.060147
-2.9200991
3.618914e-03
-27.67833946
-7.714122
poly(fruit_day, 2)2
-1.6306343
6.287045
-0.2593642
7.954354e-01
-11.98648356
8.725215
exerany
-1.4683718
1.545305
-0.9502147
3.423520e-01
-4.01375639
1.077013
healthVG
-0.6579448
1.626683
-0.4044702
6.859984e-01
-3.33737270
2.021483
healthG
2.7455162
1.612864
1.7022610
8.917878e-02
0.08884996
5.402182
healthF
7.5801726
1.771501
4.2789559
2.156689e-05
4.66220539
10.498140
healthP
9.2235030
2.607003
3.5379723
4.315760e-04
4.92931949
13.517686
exerany:healthVG
1.5963008
1.794788
0.8894090
3.741084e-01
-1.36002531
4.552627
exerany:healthG
0.4332807
1.796067
0.2412387
8.094453e-01
-2.52515103
3.391713
exerany:healthF
-6.3985866
2.064042
-3.1000268
2.017633e-03
-9.79842072
-2.998752
exerany:healthP
-6.6523428
3.052872
-2.1790438
2.968244e-02
-11.68095081
-1.623735
m_3int Residuals
par (mfrow = c (1 ,2 )); plot (m_3int, which = c (1 ,2 ))
m_3int Residuals
par (mfrow = c (1 ,2 )); plot (m_3int, which = c (3 ,5 ))
How do models m_3 and m_3int do in testing?
m3_test_aug <- augment (m_3, newdata = test_c4im)
m3int_test_aug <- augment (m_3int, newdata = test_c4im)
testing_r2 <- bind_rows (
rsq (m1_test_aug, truth = bmi, estimate = .fitted),
rsq (m2_test_aug, truth = bmi, estimate = .fitted),
rsq (m3_test_aug, truth = bmi, estimate = .fitted),
rsq (m1int_test_aug, truth = bmi, estimate = .fitted),
rsq (m2int_test_aug, truth = bmi, estimate = .fitted),
rsq (m3int_test_aug, truth = bmi, estimate = .fitted)) |>
mutate (model = c ("m_1" , "m_2" , "m_3" , "m_1int" ,
"m_2int" , "m_3int" ))
I’ve hidden my calculations for RMSE and MAE here.
Results comparing all six models (testing)
bind_cols (testing_r2 |> select (model, rsquare = .estimate),
testing_rmse |> select (rmse = .estimate),
testing_mae |> select (mae = .estimate)) |>
gt () |> tab_options (table.font.size = 20 )
model
rsquare
rmse
mae
m_1
0.07161065
5.728520
4.428381
m_2
0.06517957
5.769484
4.476413
m_3
0.06555978
5.767872
4.476380
m_1int
0.03973444
6.025257
4.623948
m_2int
0.03635748
6.082435
4.706638
m_3int
0.03572974
6.089895
4.707316
Did the polynomial term in m_3 and m_3int improve our predictions?
Splines
A linear spline is a continuous function formed by connecting points (called knots of the spline) by line segments.
A restricted cubic spline is a way to build highly complicated curves into a regression equation in a fairly easily structured way.
A restricted cubic spline is a series of polynomial functions joined together at the knots.
Such a spline gives us a way to flexibly account for non-linearity without over-fitting the model.
Restricted cubic splines can fit many different types of non-linearities.
Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
3 Knots, 2 degrees of freedom, allows the curve to “bend” once.
4 Knots, 3 degrees of freedom, lets the curve “bend” twice.
5 Knots, 4 degrees of freedom, lets the curve “bend” three times.
A simulated data set
set.seed (4322021 )
sim_data <- tibble (
x = runif (250 , min = 10 , max = 50 ),
y = 3 * (x-30 ) - 0.3 * (x-30 )^ 2 + 0.05 * (x-30 )^ 3 +
rnorm (250 , mean = 500 , sd = 70 )
)
head (sim_data, 2 )
# A tibble: 2 × 2
x y
<dbl> <dbl>
1 42.5 397.
2 35.9 414.
The sim_data, plotted.
Fitting Restricted Cubic Splines with lm and rcs
sim_linear <- lm (y ~ x, data = sim_data)
sim_poly2 <- lm (y ~ poly (x, 2 ), data = sim_data)
sim_poly3 <- lm (y ~ poly (x, 3 ), data = sim_data)
sim_rcs3 <- lm (y ~ rcs (x, 3 ), data = sim_data)
sim_rcs4 <- lm (y ~ rcs (x, 4 ), data = sim_data)
sim_rcs5 <- lm (y ~ rcs (x, 5 ), data = sim_data)
Looking at the Polynomial Fits
Looking at the Restricted Cubic Spline Fits
Fitting Restricted Cubic Splines with lm and rcs
For most applications, three to five knots strike a nice balance between complicating the model needlessly and fitting data pleasingly. Let’s consider a restricted cubic spline model for bmi based on fruit_day again, but now with:
in temp3, 3 knots, and
in temp4, 4 knots,
temp3 <- lm (bmi ~ rcs (fruit_day, 3 ), data = train_c4im)
temp4 <- lm (bmi ~ rcs (fruit_day, 4 ), data = train_c4im)
Spline models for bmi and fruit_day
Let’s try an RCS with 4 knots
m_4 <- lm (bmi ~ rcs (fruit_day, 4 ) + exerany + health,
data = train_c4im)
m_4int <- lm (bmi ~ rcs (fruit_day, 4 ) + exerany * health,
data = train_c4im)
Comparing 4 models including the exerany*health interaction…
mod
fruit
r.sq
adj.r.sq
sigma
df
AIC
BIC
m_1int
not in
0.1263202
0.1144064
6.006371
9
4315.682
4365.262
m_2int
linear
0.1375041
0.1244162
5.972330
10
4309.050
4363.137
m_3int
poly(2)
0.1375923
0.1231751
5.976561
11
4310.982
4369.576
m_4int
rcs(4)
0.1378531
0.1221061
5.980203
12
4312.779
4375.881
Tidied summary of m_4int coefficients
term
est
se
t
p
lo90
hi90
(Intercept)
28.5634270
1.543775
18.5023217
7.751118e-62
26.0205571
31.1062969
rcs(fruit_day, 4)fruit_day
-1.2119249
1.265428
-0.9577193
3.385566e-01
-3.2963080
0.8724582
rcs(fruit_day, 4)fruit_day'
2.3145244
5.822397
0.3975209
6.911125e-01
-7.2759887
11.9050376
rcs(fruit_day, 4)fruit_day''
-5.9458573
16.583638
-0.3585376
7.200562e-01
-33.2620308
21.3703162
exerany
-1.3564307
1.553859
-0.8729432
3.830130e-01
-3.9159102
1.2030489
healthVG
-0.6374547
1.628132
-0.3915251
6.955361e-01
-3.3192756
2.0443661
healthG
2.7744698
1.614570
1.7183953
8.619565e-02
0.1149883
5.4339514
healthF
7.6381022
1.774779
4.3036922
1.935601e-05
4.7147286
10.5614758
healthP
9.0463580
2.564614
3.5273755
4.489322e-04
4.8219862
13.2707299
exerany:healthVG
1.5709516
1.796510
0.8744463
3.821948e-01
-1.3882172
4.5301205
exerany:healthG
0.3723381
1.799358
0.2069283
8.361299e-01
-2.5915211
3.3361972
exerany:healthF
-6.4655738
2.067528
-3.1272008
1.842739e-03
-9.8711559
-3.0599917
exerany:healthP
-6.4950803
3.022963
-2.1485805
3.203248e-02
-11.4744338
-1.5157268
m_4int Residual Plots
par (mfrow = c (1 ,2 )); plot (m_4int, which = c (1 ,2 ))
m_4int Residual Plots
par (mfrow = c (1 ,2 )); plot (m_4int, which = c (3 ,5 ))
How do models m_4 and m_4int do in testing?
model
rsquare
rmse
mae
m_1
0.07161065
5.728520
4.428381
m_2
0.06517957
5.769484
4.476413
m_3
0.06555978
5.767872
4.476380
m_4
0.06688129
5.763084
4.470968
m_1int
0.03973444
6.025257
4.623948
m_2int
0.03635748
6.082435
4.706638
m_3int
0.03572974
6.089895
4.707316
m_4int
0.03711199
6.076300
4.698624
I’ll note that there’s a fair amount of very repetitive code in the Quarto file to create that table.
What are our conclusions?
Next Week
Using the ols function from the rms package to fit linear regression models with non-linear terms.
Be sure to submit Lab 2 to Canvas by Tuesday 2024-01-30 at Noon.
Appendix: How The class4 and class4im data were built from the smart_ohio.csv data created in the Course Notes
Creating Today’s Data Set
url1 <- "https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/smart_ohio.csv"
smart_ohio <- read_csv (url1)
class4 <- smart_ohio |>
filter (hx_diabetes == 0 ,
mmsa == "Cleveland-Elyria" ,
complete.cases (bmi)) |>
select (bmi, inc_imp, fruit_day, drinks_wk,
female, exerany, genhealth, race_eth,
hx_diabetes, mmsa, SEQNO) |>
type.convert (as.is = FALSE ) |>
mutate (ID = as.character (SEQNO - 2017000000 )) |>
relocate (ID)
Codebook for useful class4 variables
894 subjects in Cleveland-Elyria with bmi and no history of diabetes
bmi
(outcome) Body-Mass index in kg/m2 .
inc_imp
income (imputed from grouped values) in $
fruit_day
average fruit servings consumed per day
drinks_wk
average alcoholic drinks consumed per week
female
sex: 1 = female, 0 = male
exerany
any exercise in the past month: 1 = yes, 0 = no
genhealth
self-reported overall health (5 levels)
race_eth
race and Hispanic/Latinx ethnicity (5 levels)
plus ID, SEQNO, hx_diabetes (all 0), MMSA (all Cleveland-Elyria)
See Course Notes Chapter on BRFSS SMART data for variable details
Basic Data Summaries
Available approaches include:
summary
mosaic package’s inspect()
Hmisc package’s describe
all of which can work nicely in an HTML presentation, but none of them fit well on one of these slides.
Quick Histogram of each quantitative variable
Code for previous slide
p1 <- ggplot (class4, aes (x = bmi)) +
geom_histogram (fill = "navy" , col = "white" , bins = 20 )
p2 <- ggplot (class4, aes (x = inc_imp)) +
geom_histogram (fill = "forestgreen" , col = "white" ,
bins = 20 )
p3 <- ggplot (class4, aes (x = fruit_day)) +
geom_histogram (fill = "tomato" , col = "white" , bins = 20 )
p4 <- ggplot (class4, aes (x = drinks_wk)) +
geom_histogram (fill = "royalblue" , col = "white" ,
bins = 20 )
(p1 + p2) / (p3 + p4)
I also used #| warning: false in the plot’s code chunk label to avoid warnings about missing values, like this one for inc_imp:
Warning: Removed 120 rows containing non-finite values
Binary variables in raw class4
class4 |> tabyl (female, exerany) |> adorn_title ()
exerany
female 0 1 NA_
0 95 268 20
1 128 361 22
female is based on biological sex (1 = female, 0 = male)
exerany comes from a response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” (1 = yes, 0 = no, don’t know and refused = missing)
Any signs of trouble here?
I think the 1/0 values and names are OK choices.
Multicategorical genhealth in raw class4
class4 |> tabyl (genhealth)
genhealth n percent valid_percent
1_Excellent 148 0.165548098 0.16573348
2_VeryGood 324 0.362416107 0.36282195
3_Good 274 0.306487696 0.30683091
4_Fair 112 0.125279642 0.12541993
5_Poor 35 0.039149888 0.03919373
<NA> 1 0.001118568 NA
The variable is based on “Would you say that in general your health is …” using the five specified categories (Excellent -> Poor), numbered for convenience after data collection.
Don’t know / not sure / refused were each treated as missing.
How might we manage this variable?
Changing the levels for genhealth
class4 <- class4 |>
mutate (health =
fct_recode (genhealth,
E = "1_Excellent" ,
VG = "2_VeryGood" ,
G = "3_Good" ,
F = "4_Fair" ,
P = "5_Poor" ))
Might want to run a sanity check here, just to be sure…
Checking health vs. genhealth in class4
class4 |> tabyl (genhealth, health) |> adorn_title ()
health
genhealth E VG G F P NA_
1_Excellent 148 0 0 0 0 0
2_VeryGood 0 324 0 0 0 0
3_Good 0 0 274 0 0 0
4_Fair 0 0 0 112 0 0
5_Poor 0 0 0 0 35 0
<NA> 0 0 0 0 0 1
OK. We’ve preserved the order and we have much shorter labels. Sometimes, that’s helpful.
Multicategorical race_eth in raw class4
class4 |> count (race_eth)
# A tibble: 6 × 2
race_eth n
<fct> <int>
1 Black non-Hispanic 167
2 Hispanic 27
3 Multiracial non-Hispanic 19
4 Other race non-Hispanic 22
5 White non-Hispanic 646
6 <NA> 13
“Don’t know”, “Not sure”, and “Refused” were treated as missing.
What is this variable actually about?
What is the most common thing people do here?
What is the question you are asking?
Collapsing race_eth levels might be rational for some questions.
We have lots of data from two categories, but only two.
Systemic racism affects people of color in different ways across these categories, but also within them.
Is combining race and Hispanic/Latinx ethnicity helpful?
It’s hard to see the justice in collecting this information and not using it in as granular a form as possible, though this leaves some small sample sizes. There is no magic number for “too small a sample size.”
Most people identified themselves in one of the categories.
These data are not ordered, and (I’d argue) ordering them isn’t helpful.
Regression models are easier to interpret, though, if the “baseline” category is a common one.
Resorting the factor for race_eth
Let’s sort all five levels, from most observations to least…
class4 <- class4 |>
mutate (race_eth = fct_infreq (race_eth))
class4 |> tabyl (race_eth)
race_eth n percent valid_percent
White non-Hispanic 646 0.72259508 0.73325766
Black non-Hispanic 167 0.18680089 0.18955732
Hispanic 27 0.03020134 0.03064699
Other race non-Hispanic 22 0.02460850 0.02497162
Multiracial non-Hispanic 19 0.02125280 0.02156640
<NA> 13 0.01454139 NA
Not a perfect solution, certainly, but we’ll try it out.
“Cleaned” Data and Missing Values
class4 <- class4 |>
select (ID, bmi, inc_imp, fruit_day, drinks_wk,
female, exerany, health, race_eth, everything ())
miss_var_summary (class4)
# A tibble: 13 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 inc_imp 120 13.4
2 exerany 42 4.70
3 fruit_day 41 4.59
4 drinks_wk 39 4.36
5 race_eth 13 1.45
6 health 1 0.112
7 genhealth 1 0.112
8 ID 0 0
9 bmi 0 0
10 female 0 0
11 hx_diabetes 0 0
12 mmsa 0 0
13 SEQNO 0 0
Single Imputation Approach?
set.seed (43203 )
class4im <- class4 |>
select (ID, bmi, inc_imp, fruit_day, drinks_wk,
female, exerany, health, race_eth) |>
data.frame () |>
impute_cart (health ~ bmi + female) |>
impute_pmm (exerany ~ female + health + bmi) |>
impute_rlm (inc_imp + drinks_wk + fruit_day ~
bmi + female + health + exerany) |>
impute_cart (race_eth ~ health + inc_imp + bmi) |>
tibble ()
prop_miss_case (class4im)
Saving the tidied data
Let’s save both the unimputed and the imputed tidy data as R data sets.
write_rds (class4, "c04/data/class4.Rds" )
write_rds (class4im, "c04/data/class4im.Rds" )
To reload these files, we’ll use read_rds().
The main advantage here is that we’ve saved the whole R object, including all characteristics that we’ve added since the original download.